Approximate Joins for Relational Data

نویسنده

Panos Vassiliadis

چکیده

Krommydas, Ioannis, Evagelos, Georgia. MSc, Computer Science Department, University of Ioannina, Greece. June, 2008. Approximate Joins for Relational Data. Thesis Supervisor: Vassiliadis Panos. Relational databases often contain duplicate data entries. This may occur due to a variety of reasons, such as typographical errors, multiple conventions for recording database fields or other noise sources. Duplicate detection is a crucial procedure, especially for large databases. In this thesis, we present a method that extends the state-of-the-art method for duplicate detection. Given a database holding valid data information, we classify each input tuple as a new tuple, or as an existing tuple. The proposed method uses an effective algorithm for determining a set of candidate reference tuples. For each candidate reference tuple, we use appropriate similarity metrics in order to decide whether the input tuple matches a reference tuple. The whole procedure is accelerated via trie data structures for caching the frequent input tuples. Finally, we present a number of experiments evaluating the effectiveness of our method and state a comparative study with the state-of-the-art method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

Integrating heterogeneous data from sources as diverse as web pages, digital libraries, knowledge bases, the Semantic Web and databases is an open problem. The ultimate aim of our work is to be able to query such heterogeneous data sources as if their data were conveniently held in a single relational database. Pursuant to this aim, we propose a generalisation of joins from the relational datab...

متن کامل

Approximate String Joins

String data is ubiquitous and is commonly used to correlate (or join) entities across autonomous, heterogeneous databases. The main challenge is to effectively deal with the noisy nature of string data, due to, for example, transcription errors, incomplete information, and multiple conventions for recording string valued attributes. Commercial databases do not support approximate string joins d...

متن کامل

Scoped and Approximate Queries in a Relational Grid Information Service

We are developing a grid information service, RGIS, that is based on the relational data model. RGIS supports complex queries written in SQL that search for compositions (using joins) of resources. For example, we might ask it to find a Linux cluster with a certain bisection bandwidth and total memory. Such queries can be expensive to execute, however, and so we have developed several approache...

متن کامل

Approximate String Joins in a Database (Almost) for Free

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not suppo...

متن کامل

Using q-grams in a DBMS for Approximate String Processing

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Approximate Joins for Relational Data

نویسنده

چکیده

منابع مشابه

Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

Approximate String Joins

Scoped and Approximate Queries in a Relational Grid Information Service

Approximate String Joins in a Database (Almost) for Free

Using q-grams in a DBMS for Approximate String Processing

عنوان ژورنال:

اشتراک گذاری